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Abstract — In the classic multi-armed bandits problem, the goal 
is to have a policy for dynamically operating arms that each 
yield stochastic rewards with unknown means. The key metric 
of interest is regret, defined as the gap between the expected 
total reward accumulated by an omniscient player that knows 
the reward means for each arm, and the expected total reward 
accumulated by the given policy. The policies presented in prior 
work have storage, computation and regret all growing linearly 
with the number of arms, which is not scalable when the number 
of arms is large. We consider in this work a broad class of multi- 
armed bandits with dependent arms that yield rewards as a linear 
combination of a set of unknown parameters. For this general 
framework, we present efficient policies that are shown to achieve 
regret that grows logarithmically with time, and polynomially in 
the number of unknown parameters (even though the number 
of dependent arms may grow exponentially). Furthermore, these 
policies only require storage that grows linearly in the number of 
unknown parameters. We show that this generalization is broadly 
applicable and useful for many interesting tasks in networks 
that can be formulated as tractable combinatorial optimization 
problems with linear objective functions, such as maximum 
weight matching, shortest path, and minimum spanning tree 
computations. 

I. Introduction 

The problem of multi-armed bandits (MAB) is a classic one 
in learning theory. In its simplest form, there are N arms, each 
providing stochastic rewards that are independent and identi- 
cally distributed over time, with unknown means. A policy is 
desired to pick one arm at each time sequentially, to maximize 
the reward. MAB problems capture a fundamental tradeoff 
between exploration and exploitation; on the one hand, various 
arms should be explored in order to learn their parameters, and 
on the other hand, the prior observations should be exploited 
to gain the best possible immediate rewards. MABs have 
been applied in a wide range of domains including Internet 
advertising fl], ED and cognitive radio networks J5), |@). 

As they are fundamentally about combinatorial optimization 
in unknown environments, one would indeed expect to find 
even broader use of multi-armed bandits. However, we argue 
that a barrier to their wider application in practice has been the 
limitation of the basic formulation and corresponding policies, 
which generally treat each arm as an independent entity. They 
are inadequate to deal with many combinatorial problems 



of practical interest in which there are large (exponential) 
numbers of arms. In such settings, it is important to consider 
and exploit any structure in terms of dependencies between the 
arms. We show in this work that when the dependencies take 
a linear form, they can be handled tractably with policies that 
have provably good performance in terms of regret as well as 
storage and computation. 

In this work, we formulate and consider the following 
general multi-armed bandit problem. There is a vector X 
of N random variables with unknown mean that are each 
instantiated in an i.i.d. fashion over time. There is a finite 
(possibly exponentially large) set of vector actions a e T 
from which any action can be selected at each time. When 
action a is performed, all elements of X that correspond to 
non-zero elements of a are observed, and a linear reward a T X 
is obtained. This generalization captures a very broad class 
of combinatorial optimization problems with linear objectives 
and unknown random coefficients. 

A naive application of existing approaches for multi-armed 
bandits, such as the well-known UCB1 index policy of Auer 
et al. |5), for this problem would yield poor performance 
scaling in terms of regret, storage, and computation. This 
is because these approaches are focused on maintaining and 
computing quantities based on arm-specific observations and 
do not exploit potential dependencies between them. In this 
work, we instead propose smarter policies that explicitly take 
into account the linear form of the dependencies and base all 
storage and computations on the unknown variables directly, 
rather than the arms. As we shall show, this saves not only 
on storage and computation, but also substantially reduces the 
regret. 

Specifically, we first present a novel single-arm selection 
policy for Learning with Linear Rewards (LLR) requires only 
0(N) storage, and yields a regret that grows essentially [j 
as 0(N 4 \nn), where n is the time index. We also discuss 
how this policy can be modified in a straightforward manner 
while maintaining the same performance guarantees when 

'This is a simplification of our key result in section [V] which gives a tighter 
expression for the bound on regret that applies uniformly over time, not just 
asymptotically. 
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the problem is one of cost minimization rather than reward 
maximization. A key step in these policies we propose is the 
solving of a deterministic combinatorial optimization with a 
linear objective. While this is NP-hard in general (as it includes 
0-1 integer linear programming), there are still many special- 
case combinatorial problems of practical interest which can 
be solved in polynomial time. For such problems, the policy 
we propose would thus inherit the property of polynomial 
computation at each step. 

We also present in this paper a more general K-arm for- 
mulation, in which the policy is allowed to pick K > 1 
different actions at each time. We show how the single-arm 
policy can be readily extended to handle this and present the 
regret analysis for this case as well. 

Through several concrete examples, we show the applica- 
bility of our general formulation of multi-armed bandits with 
linear rewards to combinatorial network optimization. These 
include maximum weight matching in bipartite graphs (which 
is useful for user-channel allocations in cognitive radio net- 
works), as well as shortest path, and minimum spanning tree 
computation. The examples we present are far from exhausting 
the possible applications of the formulation and the policies we 
present in this work — there are many other linear-objective 
network optimization problems (6), Q. Our framework, for 
the first time, allows these problems to be solved in stochastic 
settings with unknown random coefficients, with provably 
efficient performance. 

We expect that our work will also find practical application 
in other fields where such linear combinatorial optimization 
problems arise naturally, such as algorithmic economics, data 
mining, finance, operations research and industrial engineer- 
ing. 

This paper is organized as follows. We first provide a 
survery of related work in section [II] We then give a formal 
description of the multi-armed bandits with linear rewards 
problem we solve in section [Til] In section [IV] we present our 
LLR policy and show that it requires only polynomial storage 
and polynomial computation per time period. We present the 
novel analysis of the regret of this policy in section[V]and point 
out how this analysis generalizes known results on MAB. In 
section [VT] we discuss examples and applications of maximum 
weight matching, shortest path, and minimum spanning tree 
computations to show that our policy is widely useful for 
various interesting applications in networks with the tractable 
combinatorial optimization formulation with linear objective 
functions. Section IVIll shows the numerical simulation results. 
We show an extension of our policy for choosing K largest 
values in section IVIIII Finally, we conclude with a summary 
of our contribution and point out avenues for future work in 
section [IX] 

II. Related Work 

Lai and Robbins [8| wrote one of the earliest papers on the 
classic non-Bayesian infinite horizon multi-armed bandit prob- 
lem. Assuming K independent arms, each generating rewards 
that are i.i.d. over time from a given family of distributions 



with an unknown real-valued parameter, they presented a gen- 
eral policy that provides expected regret that is 0(K log n), i.e. 
linear in the number of arms and asymptotically logarithmic 
in n. They also show that this policy is order optimal in 
that no policy can do better than 51 (K log n). Anantharam 
et al. |[9l extend this work to the case when M simultaneous 
plays are allowed. The work by Agrawal 11011 presents easier 
to compute policies based on the sample mean that also 
has asymptotically logarithmic regret. However, their policies 
need not be directly applied to our problem formulation in 
this paper, which involves combinatorial arms that cannot be 
characterized by a single parameter. 

Our work is influenced by the paper of Auer et al. J3] 
that considers arms with non-negative rewards that are i.i.d. 
over time with an arbitrary un-parameterized distribution that 
has the only restriction that it have a finite support. Further 
they provide a simple policy (referred to as UCB1), which 
achieves logarithmic regret uniformly over time, rather than 
only asymptotically. However, their work does not exploit 
potential dependencies between the arms. As we show in this 
paper, a direct application of their UCB1 policy therefore 
performs poorly for our problem formulation. 

There are also some recent works to propose decentralized 
policies for the multi-armed bandit problem. Liu and Zhao [4 |, 
and Anandkumar et al. [3] have both developed policies for 
the problem of M distributed players operating N independent 
arms. 

While these above key papers and many others have focused 
on independent arms, there have been some works treating 
dependencies between arms. The paper by Pandey et al. [T) 
divides arms into clusters of dependent arms (in our case 
there would be only one such cluster consisting of all the 
arms). Their model assumes that each arm provide only binary 
rewards, and in any case, they do not present any theoretical 
analysis on the expected regret. Ortner [1 1J proposes to use an 
additional arm color, to utilize the given similarity information 
of different arms to improve the upper bound of the regret. 
They assume that the difference of the mean rewards of 
any two arms with the same color is less than a predefined 
parameter <5, which is known to the user. This is different from 
the linear reward model in our paper. 

Mersereau et al. Ifl2l consider a bandit problem where the 
expected reward is defined as a linear function of an random 
variable, and the prior distribution is known. They show the 
upper bound of the regret is 0(y/n) and the lower bound 
of the regret is J7(y / n). Rusmevichientong and Tsitsiklis |[T3l 
extend |[T2l to the setting where the reward from each arm 
is modeled as the sum of a linear combination of a set of 
unknown static random numbers and a zero-mean random 
variable that is i.i.d. over time and independent across arms. 
The upper bound of the regret is shown to be 0(Ny/n) on 
the unit sphere and 0(N y/n log 3 ^ 2 n) for a compact set, and 
the lower bound of regret is fl(Ny/n) for both cases. The 
linear models in these works are different from our paper in 
which the reward is expressed as a linear combination as a 
set of random processes. Also, flT2l and lfT3l assume that only 
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the reward is observed at each time. In our work, we assume 
that the random variables corresponding to non-zero action 
components are observed at each time (from which the reward 
can be inferred). 

Both [14 1 and [15] consider linear reward models that are 
more general than ours, but also under the assumption that 
only the reward is observed at each time. Auer [ 14] presents a 
randomized policy which requires storage and computation to 
grow linearly in the number of arms. This algorithm is shown 
to achieve a regret upper bound of 0(\> r Ny / n log 5 (n|.F|)). 
Dani et al. |fT31 develop another randomized policy for the 
case of a compact set of arms, and show the regret is upper 
bounded by 0(Ny/n log 3 / 2 n) for sufficiently large n with 
high probability, and lower bounded by f2(iVi/n). They also 
show that when the difference in costs (denoted as A) between 
the optimal and next to optimal decision among the extremal 
points is greater than zero, the regret is upper bounded by 
0(^2" log 3 n) for sufficiently large n with high probability. To 
our best knowledge, ours is the first paper to consider linear 
rewards with observation of the random variables correspond- 
ing to non-zero action components. We present a deterministic 
policy with a deterministic combinatorial linear optimization 
problem finite time bound of regret which grows 0(7V 4 log n), 
i.e., polynomially in the number of unknown random variables 
and strictly logarithmically in time. 

Our work in this paper is an extension of our recent work 
which introduced combinatorial multi-armed bandits 1 16 1. The 
formulation in ifTBI has the restriction that the reward is 
generated from a matching in a bipartite graph of users and 
channels. Our work in this paper generalizes this to a broader 
formulation with linear reward, where the action vector is from 
a finite set. 

III. Problem Formulation 

Now we define the problem of multi-armed bandits with 
linear rewards that we solve in this paper. We consider a 
discrete time system with N unknown random processes 
Xi(n), 1 < i < N, where time is indexed by n. We assume 
that Xi{n) evolves as an i.i.d. random process over time, with 
the only restriction that its distribution have a finite support. 
Without loss of generality, we normalize Xi(n) G [0, 1]. We do 
not require that Xi(n) be independent across i. This random 
process is assumed to have a mean 9i — E[Xj\ that is unknown 
to the users. We denote the set of all these means as 9 = {Oi}. 

At each decision period n (also referred to interchange- 
ably as time slot), an iV-dimensional action vector a(n), 
representing an arm, is selected under a policy Tr(n) from 
a finite set F. We assume a,i(n) > for all 1 < i < N. 
When a particular a(n) is selected, only for those i with 
ai(n) 7^ 0, the value of Xi(n) is observed . We denote 
-4a(n) = {i : ai(n) ^ 0, 1 < i < N}, the index set of all 
<ii(n) for an arm a. We treat each a(rt) 6 J 7 as an arm. 
The reward is defined as: 



N 



i=l 



ai(n)Xi(n). 



(1) 



When a particular action/arm a(n) is selected, the random 
variables corresponding to non-zero components of a(n) are 
revealecQ, i.e., the value of Xi{n) is observed for all i such 
that a(n) ^ 0. 

We evaluate policies with respect to regret, which is defined 
as the difference between the expected reward that could be 
obtained by a genie that can pick an optimal arm at each time, 
and that obtained by the given policy. Note that minimizing 
the regret is equivalent to maximizing the rewards. Regret can 
be expressed as: 



9t£(6) = n9* - E* 



t=i 



(*)] 



(2) 



N 



where 9* = max ^ arfi, the expected reward of an optimal 

arm. For the rest of the paper, we use * as the index indicating 
that a parameter is for an optimal arm. If there is more than 
one optimal arm exist, * refers to any one of them. 

Intuitively, we would like the regret IR£ (9) to be as small 
as possible. If it is sub-linear with respect to time n, the time- 
averaged regret will tend to zero and the maximum possible 
time-averaged reward can be achieved. Note that the number 
of arms \T\ can be exponential in the number of unknown 
random variables N. 

IV. Policy Design 

A. A Naive Approach 

A straightforward, relatively naive approach to solving the 
multi-armed bandits with linear regret problem that we defined 
is to use the UCB1 policy given by Au er et al. 0. For UCB1, 



the arm that maximizes Yl + 



2 hi 



will be selected at each 



time slot, where Y k is the mean observed reward on arm k, 
and rnfc is the number of times that arm k has been played. 
This approach essentially ignores the dependencies across the 
different arms, storing observed information about each arm 
independently, and making decisions based on this information 
alone. 

Auer et al. (5) showed the following policy performance for 
regret upper bound as: 

Theorem 1: The expected regret under UCB1 policy is at 
most 



E lnn 

L k:8 k <8' K . 



(3) 



k:8 k <9' 



where A fe = 6* - 9 k , 9 k = £ a^. 
Proof: See [5, Theorem 1]. 



Note that UCB1 requires storage that is linear in the number 
of arms and yields regret growing linearly with the number of 

2 As noted in the related work, this is a key assumption in our work that 
differentiates it from other prior work on linear dependent-arm bandits 1 14], 
[ 15]. This is a very reasonable assumption in many cases, for instance, in the 
combinatorial network optimization applications we discuss in section [VTl it 
corresponds to revealing weights on the set of edges selected at each time. 
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arms. In a case where the number of arms grow exponentially 
with the number of unknown variables, both of these are highly 
unsatisfactory. 

Intuitively, UCB1 algorithm performs poorly on this prob- 
lem because it ignores the underlying dependencies. This 
motivates us to propose a sophisticated policy which more ef- 
ficiently stores observations from correlated arms and exploits 
the correlations to make better decisions. 

B. A new policy 

Our proposed policy, which we refer to as "learning with 
linear rewards" (LLR), is shown in Algorithm Q] 

Algorithm 1 Learning with Linear Rewards (LLR) 
// Initialization 

If max \Ag\ is known, let L — max \Aa\\ else, L — N; 

a a 

for p — 1 to N do 

n = p; 

Play any arm a such that p £ A a ; 
Update (§i)ixN, K)ixjv accordingly; 
end for 
// Main loop 
while 1 do 
n = n + 1; 

Play an arm a which solves the maximization problem 



a = are' max > 




(L + l)lnra 



(4) 



12: Update (6i) lxN , (rriijixN accordingly; 
13: end while 

Table U summarizes some notation we use in the description 
and analysis of our algorithm. 

The key idea behind this algorithm is to store and use 
observations for each random variable, rather than for each 
arm as a whole. Since the same random variable can be ob- 
served while operating different arms, this allows exploitation 
of information gained from the operation of one arm to make 
decisions about a dependent arm. 

We use two 1 by N vectors to store the information after 
we play an arm at each time slot. One is (9i)i X N in which 
9i is the average (sample mean) of all the observed values of 
Xi up to the current time slot (obtained through potentially 
different sets of arms over time). The other one is (mj)ixiv in 
which mi is the number of times that Xi has been observed 
up to the current time slot. 

At each time slot n, after an arm a(n) is played, we get 
the observation of Xj(n) for all i G A a ( n )- Then (#i)i x jv 
and (mi)ixjv (both initialized to at time 0) are updated as 
follows: 



{ 8i(n-\)mi(n-\)+Xi(n) 
m<(n-l)+l 
0«(n-l) 



if i € A a ( n ) 
else 



,s _ f m»(n- 1) + 1 , ifiG^4a(„) 
^ ' mi(n — 1) , else 



(5) 



(6) 



N : number of random variables. 

a : vectors of coefficients, defined on set J 7 ; 

we map each a as an arm. 
A a : {i : a, ^ 0, 1 < i < N}. 
* : index indicating that a parameter is for an 

optimal arm. 

mi'. number of times that Xi has been observed 

up to the current time slot. 
Of. average (sample mean) of all the observed 

values of up to the current time slot. 

Note that E[§i(n)] = 6>, ; . 

6i imi : average (sample mean) of all the observed 
values of Xi when it is observed rrii times. 
A a : R* — R a . 
A min : min A a . 

a^a* 

A max : max A a . 

a^a* 

T a (n): number of times arm a has been played 
in the first n time slots. 

fl mal : max max a,; . 

agj i 



TABLE 1 

Notation 



Note that while we indicate the time index in the above 
updates for notational clarity, it is not necessary to store the 
matrices from previous time steps while running the algorithm. 

LLR policy requires storage linear in N. In section [V] we 
will present the analysis of the upper bound of regret, and show 
that it is polynomial in N and logarithmic in time. Note that 
the maximization problem (01 needs to be solved as the part of 
LLR policy. It is a deterministic linear optimal problem with 
a feasible set T and the computation time for an arbitrary T 
may not be polynomial in N. As we show in Section [VI] that 
there exists many practically useful examples with polynomial 
computation time. 

V. Analysis of Regret 

Traditionally, the regret of a policy for a multi-armed bandit 
problem is upper-bounded by analyzing the expected number 
of times that each non-optimal arm is played, and the summing 
this expectation over all non-optimal arms. While such an 
approach will work to analyze the LLR policy too, it turns 
out that the upper-bound for regret consequently obtained is 
quite loose, being linear in the number of arms, which may 
grow faster than polynomials. Instead, we give here a tighter 
analysis of the LLR policy that provides an upper bound which 
is instead polynomial in and logarithmic in time. Like the 
regret analysis in [5 1, this upper-bound is valid for finite n. 

Theorem 2: The expected regret under the LLR policy is 
at most 



4a 2 max L 2 (L 



1) AT Inn 



(A n 



N - 



-LN 



A r 



(7) 
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To proof Theorem |2] we use the inequalities as stated in the each time that /^(t) = 1, we could get different arms. Then, 
Chernoff-Hoeffding bound lfl7l . n 

Lemma 1 (Chernoff-Hoeffding bound tUTjH): Ti(n) < I + }^ 1{ }^ ^ + C t -i, mj (t-i)) 
Xi,...,X n are random variables with range [0,1], and t=N+i jeA a * 

E[X t \X u ..., X t _ 1 ]=/ i , VI <t<n. Denote S„=£X 4 . < y aj (t)(% m . (i _ 1) + C t _ x m . {t ^\ %{t - 1) > 1} 

Then for all a > " ^ ,Wl),%(l ' ' A > h V ' 



jeA, 



Pr{S n <n^-a}<e-^l- ^ t=N ^ 

- a J^)^j,rn j {t)+C t ^ m .(t- ) ),T l {t)>l}. 

J'e-4a(t) 

Proof of Theorem^ Denote C t , mi as \J (L+ ^ — . We (12) 

introduce Ti(n) as a counter after the initialization period. It N ote mat / < j> .u\ implies 
is updated in the following way: 

At each time slot after the initialization period, one of the ' — — m j{t)^j •= -4 a (t)- (13) 
two cases must happen: (1) an optimal arm is played; (2) 
a non-optimal arm is played. In the first case, (Ti(n))i X jv 

won't be updated. When an non-optimal arm a(n) is picked ~ ^ y-^ y-^ „ - 

at time n, there must be at least one i e A a such that i = t ^ n ' — ^ ^ ^o<m hl ,.™r^, A <t 4-^ ahj h i> mh j 

arg min m,. If there is only one such arm, Ti(n) is increased ' ' °~ 

jS-Aa l-Aa(t)l 

by 1. If there are multiple such arms, we arbitrarily pick one, < max a (t)(S +C )j 

say i', and increment T v by 1. ~ ;<m P1 ,...,m p <t ^ w t<™ Pj )s 

a(t) J — 1 

Each time when a non-optimal arm is picked, exactly one oo t t t t 

element in (Ti(n)) lx jv is incremented by 1. This implies that < ; y^ y^ ■ ■ ■ ~y^ 'y^ ■ ■ ■ y^ 

the total number that we have played the non-optimal arms ' 
is equal to the summation of all counters in (Tj(n))i X Ar. 
Therefore, we have: 



3 = ' 



t=l m hl =l 
Ua.l 



1{ X! a h j {^h j ,m hj + C t ,m h .) 

3=1 

y E[T a (n)} = ym(n)}. (9) ^ _ 



JY 



a:a#a* z=l - "Pj V 1 - A^Pj ,m p 

3=1 

(14) 

Also note for TAn), the following inequality holds: , , ... . . . , . . , . 

v ' where hj (1 < j < |-4 a *|) represents the j-th element in A a * 

and pj (1 < j < |-4 a (t)|) represents the j-th element in -A a (t)- 

Ti(n) < mi(n),Vl < i < N. (10) \A*.\ ~ " 1-4^)1 . 

3=1 3=1 

Denote by ^ (n) the indicator function which is equal to C t,™ P] ) means that at least one of the following must be true: 
1 if Ti(n) is added by one at time n. Let / be an arbitrary \A*»\ \A 

positive integer. Then: „* 75 s- u* „* n n<\ 

F 6 2^ a hjV hj,m hj < H - a h^t,m hj , (15) 

3=1 3=1 

„ l-4a(t) l-4a(t) 

fi(n)= ^ l{7i(t) = l} 51 a P] {t)l P] , mp] >R a{t) + y a Pj {t)C t , mp . , (16) 

(ID J ' =1 3=1 

<;+ ^ i{i i (t) = i,f i (t-i)>/} .o'v' m r 

3=1 

where l(a;) is the indicator function defined to be 1 when _ , , , „ J"!^*' * tt ^ 

■ , ^ , • • ^, „„ r^.x Now we find the upper bound for Pr\ > y a. Ohm, <R — 

the predicate x is true, and when it is false. When /j(tj = ^ j ' j 

1, a non-optimal arm a(i) has been picked for which rrii = \A a ,\ 

min{mj : Vj G ^l a (f)}. We denote this arm as a(t) since at a ft 3 Ct,m^ }■ 

3 3=1 
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We have: 



I -^a* | | -4.a* | 

Pr{ E a *h$hj,m hj < R* - E a h s C t,m hj } 
3=1 3=1 

|»Aa* | |-^a* | |»^a*| 

= Pr{J2 a h^h s ,m hj < E a hj 9h i ~ E a ^ C *.™^} 
3=1 J = 1 J = 1 

< Pr{At least one of the following must hold: 



4(L+1) Inn 

\T 

a(t) 



Note that for / > 



IA>(t)l 

-R* - -Ra(t) - 2 ^ a pj (t)C t ,m p . 
3=1 

IA»(t)l 



> R* — R f 



(*) 



(L + l)lnt 



3 = 1 



(t) ~~ i fl max 



4(L + l)lnn 



(20) 



> R* -R 



■a(t) 



'4(L + l)lnn / A a(t) 



4(L + 1) Inn \La n 



1 .4.. I 



< ^ Pr{a* hj O h ^ mh] <a* hj 9 h] - a* h .C t , mh .} 

3 = 1 
I Am. I 

= E Pr ^ - <^-c w .}. 

3=1 

VI < j < \A a *\, applying the Chernoff-Hoeffding bound 
stated in Lemma Q] we could find the upper bound of each 
item in the above equation as, 

Pr{O hj , mh . < hj - C t . mh] } 
= Pr{m h ~dh J , mhj < m hj hj - rn hj C t , mh . } 

-2— ^-(m h .) 2 - (I - + 1)lnt 
_ e -2(L+l)lnt 
_ f -2(L+l) 



>R*- R*(t) - A a(t) = o. 

Equation ( f20b implies that condition ( TT3b is false when £ = 



4(L+1) Inn 



If we let I = 



for all a(t) 
Therefore, 



E[Ti(n)] < 



4(L+1) Inn 



then ([13] ) is false 



4(L + l)lnn 



Thus. 



l-4-a* | l-^-a* | 

Pr{J2 a* hj l hj , mhj <R*-Y, <ft, mhj ] 

3=1 3=1 

< |.4 a *|<- 2(i+1) 



(18) 



oo ( t t t 

+E E ••• E E 

t=l ym hl =l mi^.plmjjil m p 

< A <- L2{L+ 2 1)lnn + l + L±2t- 

(A m i n ) t=1 

(A min ) 2 3 
So under LLR policy, we have: 

n 

<Rl(Q) = R*n-W[J2 R*(t)(t)} 
t=i 

= Y, A aE[T a (n)] 

a:_R a <-R* 

< A max ^ E[T a (n)] 



£ 2L<- 2 ( i+1 ) 



(21) 



a:R a <R* 
N 



Similarly, we can get the upper bound of the probability for 
inequality ( IT6l ): 

IA,( t )l lA^tjl 
Pr{ Y a Pj (t)9 p ^ rnp] >R a(t )+ E a vM C t,m Pj } 

3 = 1 3=1 

(19) 



< 



< 



i=l 



A max ^E[T 4 (n)] 

N 

E 

i=l (A m i n/ ) 

4a niax L 2 (L + l)iVlnn 
(A min ) 2 



4a nlax Z/(L + 1) Inn Ar 7r 2 t , r 

— — — ^ h TV H LTV 

3 



+ iV+ — LiV 



A r 



(22) 



7 



Remark 1: Note that when the set of action vectors consists 
of binary vectors with a single "1", the problem formulation 
reduces to an multi-armed bandit problem with N independent 
arms. In this special case, the LLR algorithm is equivalent to 
UCB1 in 0. Thus, our results generalize that prior work. 

Remark 2: We have presented J 7 as a finite set in our 
problem formation. We note that the LLR policy we have 
described and its analysis actually also work with a more gen- 
eral formulation when T is an infinite set with the following 
additional constraints: the maximization problem in (01 always 
has at least one solution; A m j n exists; a* is bounded. With the 
above constraints, Algorithm Q] will work the same and the 
conclusion and all the details of the proof of Theorem [2] can 
remain the same. 

Remark 3: Theorem [2] also holds for random variables 
Xi, 1 < i < N that are not i.i.d. over time, but with the 
only weaker assumption that E[Xi(t)\Xi(l), . . . , Xi(t — 1)] = 
@i, VI < i < N. This is because the Chernoff-Hoeffding bound 
only needs a weak assumption E[Xj,(t)\Xi(l), . . . ,Xi(t — 
l)]=6i,Vl<i<N. 

VI. Applications 

We now describe some applications and extensions of the 
LLR policy for combinatorial network optimization in graphs 
where the edge weights are unknown random variables. 

A. Maximum Weighted Matching 

Maximum Weighted Matching (MWM) problems are 
widely used in the many optimization problems in wireless 
networks such as the prior work in fl8l . fl9l . Given any graph 
G = (V, E), there is a weight associated with each edge and 
the objective is to maximize the sum weights of a matching 
among all the matchings in a given constraint set, i.e., the 
general formulation for MWM problem is 



\E\ 



max R™ WM = J2aiWi 

8=1 

s.t. a is a matching 



(23) 



where Wi is the weight associated with each edge i. 

In many practical applications, the weights are unknown 
random variables and we need to learn by selecting different 
matchings over time. This kind of problem fits the general 
framework of our proposed policy regarding the reward as 
the sum weight and a matching as an arm. Our proposed 
LLR policy is a solution with linear storage, and the regret 
polynomial in the number of edges, and logarithmic in time. 

Since there are various algorithms to solve the different 
variations in the maximum weighted matching problems, such 
as the Hungarian algorithm for the maximum weighted bipar- 
tite matching [20|, Edmonds's matching algorithm [21] for a 
general maximum matching. In these cases, the computation 
time is also polynomial. 

Here we present a general problem of multiuser channel 
allocations in cognitive radio network. There are M secondary 
users and Q orthogonal channels. Each secondary user requires 



a single channel for operation that does not conflict with 
the channels assigned to the other users. Due to geographic 
dispersion, each secondary user can potentially see different 
primary user occupancy behavior on each channel. Time is di- 
vided into discrete decision rounds. The throughput obtainable 
from spectrum opportunities on each user-channel combination 
over a decision period is denoted as Sij and modeled as an 
arbitrarily-distributed random variable with bounded support 
but unknown mean, i.i.d. over time. This random process is 
assumed to have a mean 8ij that is unknown to the users. 
The objective is to search for an allocation of channels for all 
users that maximizes the expected sum throughput. 

Assuming an interference model whereby at most one 
secondary user can derive benefit from any channel, if the 
number of channels is greater than the number of users, an 
optimal channel allocation employs a one-to-one matching of 
users to channels, such that the expected sum-throughput is 
maximized. 

Figure Q] illustrates a simple scenario. There are two sec- 
ondary users (i.e., links) SI and S2, that are each assumed to be 
in interference range of each other. S 1 is proximate to primary 
user PI who is operating on channel 1. S2 is proximate 
to primary user P2 who is operating on channel 2. The 
matrix shows the corresponding O, i.e., the throughput each 
secondary user could derive from being on the corresponding 
channel. In this simple example, the optimal matching is for 
secondary user 1 to be allocated channel 2 and user 2 to be 
allocated channel 1. Note, however, that, in our formulation, 
the users are not a priori aware of the matrix of mean values, 
and therefore must follow a sequential learning policy. 

CI C2 




0.3 


0.8 


0.9 


0.2 




S2 



Fig. 1. An illustrative scenario 



Note that this problem can be formulated as a multi-armed 
bandits with linear regret, in which each arm corresponds 
to a matching of the users to channels, and the reward 
corresponds to the sum-throughput. In this channel allocation 
problem, there is M x Q unknown random variables, and the 
number of arms are P(Q, M), which can grow exponentially 
in the number of unknown random variables. Following the 
convention, instead of denoting the variables as a vector, we 
refer it as a M by Q matrix. So the reward as each time slot 
by choosing a permutation a is expressed as: 



A I 



i=\ j=l 



(24) 



where a € J 7 , J 7 is a set with all permutations, which is defined 



s 



as: 

Q Q 
T = {a : dij G {0, l},Vi,i A^Oij = 1 A^Ojj = 1}. 

(25) 

We use two M by Q matrices to store the information after 
we play an arm at each time slot. One is (6i,j)MxQ in which 
6i,j is the average (sample mean) of all the observed values 
of channel j by user i up to the current time slot (obtained 
through potentially different sets of arms over time). The other 
one is {m^j) m-kQ m which rriij is the number of times that 
channel j has been observed by user i up to the current time 
slot. 

Applying Algorithm Q] we get a linear storage policy for 
which (0ij)M X Q and (rriij)MxQ are stored and updated at 
each time slot. The regret is polynomial in the number of users 
and channels, and logarithmic in time. Also, the computation 
time for the policy is also polynomial since in Algorithm!]] 
now becomes the following deterministic maximum weighted 
bipartite matching problem 



arg max > 



'i,3 



'(L + l)lnn 



(26) 



on the bip artite grap h of users and channels with edge weights 
(Oi.j + \J ^ L+ ^. ■ It could be solved with polynomial 
computation time (e.g., using the Hungarian algorithm [20]). 
Note that L = max|.4 a | = min{Af, Q} for this problem, 

a 

which is less than M x Q so that the bound of regret is tighter. 
The regret is 0(min{M, Q} 3 MQ logn) following Theorem[2] 

B. Shortest Path 

Shortest Path (SP) problem is another example where 
the underlying deterministic optimization can be done with 
polynomial computation time. If the given directed graph is 
denoted as G = (V, E) with the source node s and the 
destination node d, and the cost (e.g., the transmission delay) 
associated with edge is denoted as Di j > 0, the 

objective is find the path from s to d with the minimum sum 
cost, i.e., 



min Cf p = a i,j D i,j 
s.t. a,j€{0,l},V(i,j)€.E 

Vi, ^ a *j ~ a i,i = 



-1 





(27) 
(28) 

i = s 

i = t (29) 
otherwise 



where equation d28t and d29l defines a feasible set T, such that 
T is the set of all possible pathes from s to d. When (Dij ) are 
random variables with bounded support but unknown mean, 
i.i.d. over time, an dynamic learning policy is needed for this 
multi-armed bandit formulation. 

Note that corresponding to the LLR policy with the objec- 
tive to maximize the rewards, a direct variation of it is to find 
the minimum linear cost defined on finite constraint set T, 



by changing the maximization problem in to a minimization 
problem. For clarity, this straightforward modification of LLR 
is shown below in Algorithm [2] which we refer to as Learning 
with Linear Costs (LLC). 

Algorithm 2 Learning with Linear Cost (LLC) 

// Initialization part is same as in AlgorithmQ] 
// Main loop 
while 1 do 

n = n + 1; 

Play an arm a which solves the minimization problem 



a = are; min > 
ae.F ^ 



(L+ l)lnn 



(30) 



6: Update 
7: end while 



, WixJV accordingly; 



LLC (Algorithm [2J is a policy for a general multi-armed 
bandit problem with linear cost defined on any constraint set. 
It is directly derived from the LLR policy (Algorithm [TJ, so 
Theorem [2] also holds for LLC, where the regret is defined as: 



KZ(e) = E*[J2C <t) (t)]-nC* 
t—i 



(31) 



where C* represents the minimum cost, which is cost of the 
optimal arm. 

Using the LLC policy, we map each path between s and t as 
an arm. The number of unknown variables are \E\, while the 
number of arms could grow exponentially in the worst case. 
Since there exist polynomial computation time algorithms such 
as Dijkstra's algorithm [22 1 and Bellman-Ford algorithm [23|, 
[24 1 for the shortest path problem, we could apply these 

algorithms to solve d30b with edge cost §i — J ^ L+ m 
LLC is thus an efficient policy to solve the multi-armed bandit 
formulation of the shortest path problem with linear storage, 
polynomial computation time. Note that L = max |^4 a | = \E\. 

Regret is 0(|£;| 4 logn). 

Another related problem is the Shortest Path Tree (SPT), 
where problem formulation is similar, and the objective is to 
find a subgraph of the given graph with the minimum total 
cost between a selected root s node and all other nodes. It is 
expressed as ||25ll , [26]: 

min Cf PT 



s.t. 



(I 



E 

,j £{0,i},V(i,j)eE 

a j-,i ~ Y2 

(ja)eBS(i) (i,j)EFS(i) 



-n - 



1 



1 



i e V/{s} 



(32) 



(33) 



(34) 



where BS{i) = {(u,v) G E : v = i}, FS(i) = {(u,v) G E : 
u = i}. (l34l and ( 1331 defines the constraint set F. We can 
also use the polynomial computation time algorithms such as 
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Dijkstra's algorithm and Bellman-Ford algorithm to solve (f30l > 
for the LLC policy. 

C. Minimum Spanning Tree 

Minimum Spanning Tree (MST) is another combinatorial 
optimization with polynomial computation time algorithms, 
such as Prim's algorithm 11271 and Kruskal's algorithm |28|. 
The objective for the MST problem can be simply presented 

as 

minCf 5T = V (HjDij (35) 

where T is the set of all spanning trees in the graph. 

With the LLC policy, each spanning tree is treated as an 
arm, and L = \E\. Regret bound also grows as 0( | ii7 1 4 logn). 

VII. Numerical Simulation Results 
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Fig. 2. Simulation Results of a system with 7 orthogonal channels and 4 
users. 
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Fig. 3. Simulation Results of a system with 9 orthogonal channels and 5 
users. 

We present in the section the numerical simulation results 
with the example of multiuser channel allocations in cognitive 
radio network. 

Fig |2] shows the simulation results of using LLR policy 
compared with the naive policy in IIV-AI We assume that the 
system consists of Q — 7 orthogonal channels in and M = 4 



secondary users. The throughput {Sij(t)}t>i for the user- 
channel combination is an i.i.d. Bernoulli process with mean 
is unknown to the players) shown as below: 
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(36) 



where the components in the box are in the optimal arm. 
Note that P(7, 4) = 840 while 7 x 4 = 28, so the storage 
used for the naive approach is 30 times more than the LLR 
policy. Fig shows the regret (normalized with respect to the 
logarithm of time) over time for the naive policy and the LLR 
policy. We can see that under both policies the regret grows 
logarithmically in time. But the regret for the naive policy is 
a lot higher than that of the LLR policy. 

Fig[3]is another example of the case when Q = 9 and M = 
5. The throughput is also assumed to be an i.i.d. Bernoulli 
process, with the following mean: 



(0i,j) = 
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(37) 



For this example, P(9, 5) = 15120, which is much higher 
than 9 x 5 = 45 (about 336 times higher), so the storage 
used by the naive policy grows much faster than the LLR 
policy. Comparing with the regrets shown in Table ILT1 for both 
examples when t = 2 x 10 6 , we can see that the regret also 
grows much faster for the naive policy. 



TABLE II 
Regret when t = 2 x 10' 





Naive Policy 


LLR 


7 channels, 4 users 


2443.6 


163.6 


9 channels, 5 users 


24892.6 


345.2 



VIII. K Simultaneous Actions 

The reward-maximizing LLR policy presented in Algorithm 
Q] and the corresponding cost-minimizing LLC policy pre- 
sented in |2] can also be extended to the setting where K arms 
are played at each time slot. The goal is to maximize the total 
rewards (or minimize the total costs) obtained by these K 
arms. For brevity, we only present the policy for the reward- 
maximization problem; the extension to cost-minimization is 
straightforward. The modified LLR-K policy for picking the 
K best arms are shown in Algorithm [3] 

Theorem [3] states the upper bound of the regret for the 
extended LLR-K policy. 
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Algorithm 3 Learning with Linear Rewards while selecting 
K arms (LLR-K) 

// Initialization part is same as in Algorithm!]] 
// Main loop 
while 1 do 
n = n + 1; 

Play arms {a}x e T with K largest values in (|38l 




(38) 



6: Update (0i)i X N, (wij)ixAr for all arms accordingly; 
7: end while 



Theorem 3: The expected regret under the LLR-K policy 
with K arms selection is at most 



4a 1 2 nax L 2 (L + l)jVlnn 
(A min ) 2 



N H LK 2L N 

3 



(39) 



Proof: 

The proof is similar to the proof of Theorem [2] but now we 
have a set of K arms with K largest expected rewards as the 
optimal arms. We denote this set as 21* = {a* ,fc , 1 < k < K} 
where a* ,fc is the arm with fc-th largest expected reward. As 
in the proof of Theorem [2] we define Ti(n) as a counter when 
a non-optimal arm is played in the same way. Equation (0, 
(HO), CCD and (TO) still hold. 

Note that each time when Ii(t) — 1, there exists some 
arm such that a non-optimal arm is picked for which vm is 
the minimum in this arm. We denote this arm as a(i). Note 
that ait) means there exists m, 1 < m < K, such that the 
following holds: 



T t (n)<l + J2{ E a 



(t) + Ct, mj (t)) 



t=N jeA a 



< £ a i (t)(t i)m . (t) + C t>m . (t) ) ! T i (t)>Z}. 



(40) 



Since at each time K arms are played, so at time t, an 
random variable could be observed up to Kt times. Then ( fT4l 
should be modified as: 



Ki 



Kt Kt 



Kt 



T l{n )<i + Y, E - E E ••• E 

t = l m hl =l m h\A*<™\ =1 m Pl— 1 m V\A (t) 

{ E a hJ n ^ h i< m ^ + Ct - m h 1 ) 

l^a( t )l 

< E OPS&WPS,™,, +°t,m Pj )}- 

i=i 



(41) 



Equation ( TT3T > to < f2Qb are similar by substituting a* with 
a*'™ 1 . So, we have: 



Epi(n)] < 



4(L + l)lnn 



Kt 



Kt Kt 



Kt 



EE- E E ••■ E 



t=i \ m hl =l rrn^.plt 
,2 r 2^ 



=; 



< 4a 2 nax L 2 (L + l)lnn j ^^ 2L 
(Amin) 2 3 



Hence, we get the upper bound for the regret as: 



(42) 



K(6) < 



4a 2 nax L 2 (L + l)Annn 

(Amin)' 



N - 



-LK 2L N 



(43) 



IX. Conclusion 

We have considered multi-armed bandit problems that pro- 
vide for arms with rewards that are a linear function of a 
smaller set of random variables with unknown means. For 
such problems, if the number of arms is exponentially large 
in the number of underlying random variables, existing arm- 
based index policies such as the well-known UCB1 [5| have 
poor performance in terms of storage, computation, and regret. 
The LLR and LLR policies we have presented are smarter 
in that they store and make decisions at each time based 
on the stochastic observations of the underlying unknown- 
mean random variables alone; they require only linear storage 
and result in a regret that is bounded by a polynomial 
function of the number of unknown-mean random variables. If 
the deterministic version of the corresponding combinatorial 
optimization problem can be solved in polynomial time, our 
policy will also require only polynomial computation per step. 
We have shown a number of problems in the context of 
networks where this formulation would be useful, including 
maximum-weight matching, shortest path and spanning tree 
computations. 

While this work has provided useful insights into real-world 
linear combinatorial optimization with unknown-mean random 
coefficients, there are many interesting open problems to be 
explored in the future. One open question is to derive a lower 
bound on the regret achievable by any policy for this problem. 
We conjecture on intuitive grounds that it is not possible to 
have regret lower than fl(N\ogn), but this remains to be 
proved rigorously. It is unclear whether the lower bound can 
be any higher than this, and hence, it is unclear whether it is 
possible to prove an upper bound on regret for some policy 
that is better than the 0(N 4 log n) upper bound shown in our 
work. 

In the context of channel access in cognitive radio networks, 
other researchers have recently developed distributed policies 



II 



in which different users each select an arm independently O, 
J4). A closely related problem in this setting would be to 
have distributed users selecting different elements of the 
action vector independently. The design and analysis of such 
distributed policies is an open problem. 

Finally, it would be of great interest to see if it is possible to 
also tackle non-linear reward functions, at least in structured 
cases that have proved to be tractable in deterministic settings, 
such as convex functions. 
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